Xet Storage Not Deduplicating for Even Simple Binary Files

I have migrated to xet storage. And today, I try to test that is the xet really working?

My test is, simply generate a all one (int) array using numpy, and upload it to huggingface.

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

And update it.

pip install -U "huggingface_hub[cli,hf_xet]"
huggingface-cli.exe upload lyk/XetTest . --repo-type=dataset
Start hashing 1 files.
Finished hashing 1 files.
Uploading files using Xet Storage..

It shows that I am using xet but finally I got the LFS storage at 40MB, just as large as the raw simple file, no deduplication.

Well, maybe it only dedupilcates history commits. And I just generate a twice large file

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

Then I upload it and get a 120MB LFS storage usage.

And during the whole process, the progress bar in terminal shows that I uploaded the whole files(40MB and 80MB) although xet is enabled.

I don’t know why xet does not work. Any thing wrong?

1 Like

I think it’s probably just a bug, but I’m not sure where to report it…

And something even stranger is that, no lfs file is removed after super squash, although all history is removed. I only find this kind of situation in my test repo, and super squash works well in my other repo.

Well, I did something more than uploading bigger file. I uploaded one.npy, updated it, and then upload the raw one.npy again. Maybe that is the reason

1 Like

Xet Storage Not Deduplicating for Even Simple Binary Files · Issue #3090 · huggingface/huggingface_hub

Xet Storage Not Deduplicating for Even Simple Binary Files · Issue #343 · huggingface/xet-core

1 Like

xet-team/README · Xet Storage Not Deduplicating for Even Simple Binary Files

1 Like

I see.

From github:

rajatarya

The file size shown when uploading will reflect the total file size for the file, not the deduplicated file size. The experience of deduplication will be the time necessary for completing the file upload will be less due to less bytes traveling on the wire.

Deduplication will occur across all files, not just across commit history.

Does this means that a frequently appended parquet file will not be dedupilacted in block level?

1 Like

I wonder.

I think Rajat answered the question of appending to Parquet files here, but just to reiterate: Yes, if you’re appending to a Parquet file and uploading it, only the new chunks will need to be transferred (other content will be deduplicated if it has already been uploaded).

1 Like